Introduction

Phenotype prediction has been performed using differentially expressed genes (DE genes) between different conditions or cell types. Here we present a method to predict cell types based on dimension reduction of gene expression data from single cells.

HiPSC dataset

Data summary

groups Freq
1 9083
2 8977
3 526
4 201

Methodology

  1. Classify data into two clusters: cluster of interest and other clusters (e.g. cluster 1 vs. cluster 2, 3 and 4)
  2. Perform a principal component analysis (PCA) or multidimensional scaling (MDS) using all genes
  3. For all eigenvectors, test if there is a significant difference between the values for the cluster of interest and the other clusters using a Mann-Whitney test
  4. Adjust p-values using a bonferroni correction
  5. Keep those eigenvectors with adjusted p-values below 0.05
  6. Use significant eigenvectors as features for prediction

Performance results

Cluster 1

MDS distributions

Method Significant eigenvectors Cumulative variance
MDS 14 24.13

PCA distributions

Method Significant eigenvectors Cumulative variance
PCA 9 79.87

There seems to be no clear difference between the distributions of both clusters. This may explain why the prediction accuracy value is close to 0.5 (about 0.6, see Performance section) as the probability of one cell belonging to one cluster or other is almost equal. The increase of ~ 0.1 accuraccy value may be due to small regions of non-overlapped distributions which discriminate cluster identity. This accuraccy increment is related to the increase of sensitivity due to enrichment of target cluster (see isolated and marginal red lines) as the number of eigenvectors considered for prediction increases.

Performance

Dimension reduction methods exhibit better accuracy, specificity and kappa values when compared to the prediction based on DE genes.

Cluster 2

MDS distributions

Method Significant eigenvectors Cumulative variance
MDS 11 22.26

PCA distributions

Method Significant eigenvectors Cumulative variance
PCA 8 3.29

Due to the abundance of isolated distributions of the negative class (cluster 134), the specificity increases (opposite to the behavior observed for cluster 1).

Performance

Dimension reduction methods exhibit better accuracy, sensitivity and kappa values when compared to the prediction based on DE genes.

Cluster 3

MDS distributions

Method Significant eigenvectors Cumulative variance
MDS 16 25.53

PCA distributions

Method Significant eigenvectors Cumulative variance
PCA 17 80

Performance

Cluster 4

MDS distributions

Method Significant eigenvectors Cumulative variance
MDS 13 19.16

PCA distributions

Method Significant eigenvectors Cumulative variance
PCA 12 79.66

Performance